Codeswitching language identification using Subword Information Enriched Word Vectors

نویسنده

  • Meng Xuan Xia
چکیده

Codeswitching is a widely observed phenomenon among bilingual speakers. By combining subword information enriched word vectors with linear-chain Conditional Random Field, we develop a supervised machine learning model that identifies languages in a English-Spanish codeswitched tweets. Our computational method achieves a tweet-level weighted F1 of 0.83 and a token-level accuracy of 0.949 without using any external resource. The result demonstrates that named entity recognition remains a challenge in codeswitched texts and warrants further work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word-Level Language Identification and Predicting Codeswitching Points in Swahili-English Language Data

Codeswitching is a very common behavior among Swahili speakers, but of the little computational work done on Swahili, none has focused on codeswitching. This paper addresses two tasks relating to Swahili-English codeswitching: word-level language identification and prediction of codeswitch points. Our two-step model achieves high accuracy at labeling the language of words using a simple feature...

متن کامل

Better Word Embeddings for Korean

Vector representations of words that capture semantic and syntactic information accurately is critical for the performance of models that use these vectors as inputs. Algorithms that only use the surrounding context at the word level ignore the subword level relationships which carry important meaning especially for languages that are highly inflected such as Korean. In this paper we compare th...

متن کامل

Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching

We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the sp...

متن کامل

Sentence Boundary Detection for French with Subword-Level Information Vectors and Convolutional Neural Networks

In this work we tackle the problem of sentence boundary detection applied to French as a binary classification task (”sentence boundary” or ”not sentence boundary”). We combine convolutional neural networks with subword-level information vectors, which are word embedding representations learned from Wikipedia that take advantage of the words morphology; so each word is represented as a bag of t...

متن کامل

Subword lexical modelling for speech recognition

In this work, we introduce and develop a novel framework, angie, for modelling subword lexical phenomena in speech recognition. Our framework provides a exible and powerful mechanism for capturing morphology, syllabi cation, phonology, and other subword e ects in a hierarchical manner which maximizes sharing of subword structures. Angie models the subword structure within a context-free grammar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016